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Abstract 

Background: A key goal of systems biology and translational genomics is to utilize high-throughput measurements 
of cellular states to develop expression-based classifiers for discriminating among different phenotypes. Recent 
developments of Next Generation Sequencing (NGS) technologies can facilitate classifier design by providing 
expression measurements for tens of thousands of genes simultaneously via the abundance of their mRNA 
transcripts. Because NGS technologies result in a nonlinear transformation of the actual expression distributions, their 
application can result in data that are less discriminative than would be the actual expression levels themselves, were 
they directly observable. 

Results: Using state-of-the-art distributional modeling for the NGS processing pipeline, this paper studies how that 
pipeline, via the resulting nonlinear transformation, affects classification and feature selection. The effects of different 
factors are considered and NGS-based classification is compared to SAGE-based classification and classification 
directly on the raw expression data, which is represented by a very high-dimensional model previously developed for 
gene expression. As expected, the nonlinear transformation resulting from NGS processing diminishes classification 
accuracy; however, owing to a larger number of reads, NGS-based classification outperforms SAGE-based classification. 

Conclusions: Having high numbers of reads can mitigate the degradation in classification performance resulting 
from the effects of NGS technologies. Hence, when performing a RNA-Seq analysis, using the highest possible 
coverage of the genome is recommended for the purposes of classification. 



Background 

In recent years, modern high throughput sequencing 
technologies have become one of the essential tools in 
measuring the number of transcripts of each gene in a cell 
population or even in individual cells. Such information 
could be used to detect differential gene expression due 
to different treatment or phenotype. In our case we are 
interested in using gene-expression measurements to clas- 
sify phenotypes into one of two classes. The accuracy of 
classification will depend on the manner in which the phe- 
nomena are transformed into data by the measurement 
technology. We consider the effects of Next-Generation 
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Sequencing (NGS) and Serial Analysis of Gene Expression 
(SAGE) on gene-expression classification using currently 
accepted measurement modeling. The accuracy of clas- 
sification problem has previously been addressed for the 
LC-MS proteomics pipeline, where state-of-the-art mod- 
eling is more refined, the purpose being to character- 
ize the effect of various noise sources on classification 
accuracy [1]. 

NGS technology provides a discrete counting measure- 
ment for gene-expression levels. In particular, RNA-Seq 
sequences small RNA fragments (mRNA) to measure 
gene expression. When a gene is expressed, it produces 
mRNAs. The RNA-Seq experiment randomly shears and 
converts the RNA fragments to cDNAs, sequences them, 
and finally outputs the results in the form of short reads 
[2,3]. After obtaining those reads, a typical part of a 
processing pipeline is to map them back to a reference 



O© 2013 Ghaffari etal.; licensee BioMed Central ftd. This is an Open Access article distributed under the terms of the Creative 
BlOlVlGCl C^ntrd! Commons Attribution License (http://creativecommons.Org/licenses/by/2.0), which permits unrestricted use, distribution, and 
reproduction in any medium, provided the original work is properly cited. 



Ghaffari etal. BMC Bioinformatics 2013, 14:307 
http://www.biomedcentral.eom/1 471 -21 05/1 4/307 



Page 2 of 14 



genome to determine the gene-expression levels. The 
number of reads mapped to a gene on the reference 
genome defines the count data, which is a discrete mea- 
sure of the gene-expression levels. Two popular models 
for statistical representation of the discrete NGS data are 
the negative binomial [4,5] and Poisson [6]. The nega- 
tive binomial model is more general because it can mit- 
igate over-dispersion issues associated with the Poisson 
model; however, with the relatively small number of sam- 
ples available in most current NGS experiments, it is 
difficult to accurately estimate the dispersion parameter 
of the negative binomial model. Therefore, in this study 
we choose to model the NGS data processing pipeline 
through the transformation via a Poisson model, for the 
purposes of phenotype classification. 

SAGE technology produces short continuous sequences 
of nucleotides, called tags. After a SAGE experiment is 
done, one can measure the expression level of a particu- 
lar region/gene of interest on the genome by counting the 
number of tags that map to it. SAGE is very similar to 
RNA-Seq in nature and in terms of statistical modeling. 
The SAGE data processing pipeline is traditionally mod- 
eled as a Poisson random vector [7,8]. We follow the same 
approach for synthetically generated SAGE-like data sets. 

Our overall methodology is to generate three different 
types of synthetic data: (1) actual gene expression con- 
centration, called MVN-GC, from a multivariate normal 
(Gaussian) model formulated to model various aspects 
of gene expression concentration [9]; (2) Poisson trans- 
formed MVN-GC data, called NGS-reads, with spec- 
ifications that resemble NGS reads; and (3) Poisson 
transformed MVN-GC data, called SAGE-tags, where the 
characteristics of the data model SAGE data. The classifi- 
cation results related to these three different types of data 
sets indicate that MVN-GC misclassification errors are 
lower compared to data subjected to transformations that 
produce either NGS-reads or SAGE-tags data. Moreover, 
classification using RNA-Seq synthetic data outperforms 
classification using SAGE data when the number of reads 
is in an acceptable range for an RNA-Seq experiment. 
The better performance is attributed to the significantly 
higher genome coverage associated with the RNA-Seq 
technology. 

Next-generation sequencing technologies 

Next-Generation Sequencing refers to a class of tech- 
nologies that sequence millions of short DNA fragments 
in parallel, with a relatively low cost. The length and 
number of the reads differ based on the technology. 
Currently, there are three major commercially available 
platforms/technologies for NGS: (1) Illumina, (2) Roche, 
and (3) Life Technologies [10,11]. The underlying chem- 
istry and technique used in each platform is unique and 
affects the output. In this paper, we focus on Illumina 



sequencers and use the NGS term to refer to this plat- 
form. High-throughput sequencing and NGS are used 
interchangeably. 

The specific application of NGS for RNA sequenc- 
ing is called RNA-Seq [2], which is a high- throughput 
measurement of gene-expression levels of thousands of 
genes simultaneously as represented by discrete expres- 
sion values for regions of interest on the genome 
(e.g. genes). NGS has many advantages when compared 
to the available microarray expression platforms. NGS 
does not depend on prior knowledge about regions of 
expression to measure a gene [11], whereas the microar- 
ray probes are designed based on known sequences [12]. 
The background correction, probe design and spot filter- 
ing, which are typical for microarray-based technology, 
are no longer problematic due to the different nature of 
NGS technology. RNA-Seq enables scientists to discover 
new splice junctions [13], due to its flexibility and inde- 
pendence of the pre-designed probes. The prediction of 
absolute copy-number variation (CNV) is another great 
advantage of RNA-Seq, which allows scientists to iden- 
tify large segments of deletions/duplications in a genome 
with respect to another (a reference) genome [14]. Detect- 
ing single-nucleotide polymorphisms (SNP) is yet another 
application of RNA-Seq [3,15]. Furthermore, it has been 
shown that RNA-Seq can detect spike-in RNA controls 
with good accuracy and reproducibility [16]. 

The RNA-Seq process starts with samples that are ran- 
domly sheared to generate millions of small RNA frag- 
ments. These fragments are then converted to cDNA 
and the adapter sequences are ligated to their ends. The 
length of a fragment can vary between 30 bp - 110 bp, 
approximately. The Illumina system provides flowcells 
for sequencing by synthesis and its reversible terminator 
chemistry [2,3,10]. A flowcell is an eight-channel glass and 
each channel is commonly referred to as a lane. The size 
selected fragments with the adapters attach to the flow- 
cell surface inside the lanes and generate clusters of the 
same fragment through bridge amplification. Following 
the bending and attaching of both sides of the fragment to 
the surface, the strand duplicates. This process is repeated 
many times and results in a cluster of fragments. After 
the cluster generation step, a pool of floating nucleotides 
is added to the flowcell along with DNA polymerase to 
incorporate to the single strand fragments in each cluster. 
Each nucleotide incorporation makes a unique fluores- 
cent label and the images are captured after the addition. 
Finally, image processing and base calling determine the 
base at each cycle of the sequencing. Each cluster pro- 
duces a read whose length equals the number of cycles. 
Each RNA-Seq experiment produces millions of reads 
depending on the input RNA material, length of the 
reads, desired coverage for the reference genome, num- 
ber of samples per lane, etc. Following the sequencing 
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experiment, the expression levels of the genes are esti- 
mated by mapping the reads to the reference genome. 
There are many algorithms developed for this task, includ- 
ing: ELAND [3], Bowtie [17], BWA [18], MAQ [19], 
SHRiMP [20], mrFast [14], mrsFAST [21], SOAP [22], 
etc. After the gene expressions are determined, they can 
be used in further analysis, such as SNP detection or 
detecting differentially expressed genes. 

The entire RNA-Seq sample processing pipeline from 
generating reads to calling gene expressions can involve 
different sources of error, e.g. the basecalling is a proba- 
bilistic procedure and the quality scores assigned to each 
base of the reads are prone to small likelihoods of being 
wrong. The certainty of reference genomes and mapping 
algorithms are additional issues that need attention. Thus, 
the entire RNA-Seq sample processing pipeline should be 
considered from a probabilistic point of view. In this study, 
we model the above mentioned errors as a noise term in 
the model of the sample processing pipeline. 

Discussion 

In this section, we consider three specific models for 
"actual" gene expression data, RNA-Seq count data and 
SAGE tags data. The models are used to synthetically 
generate data for the simulation experiments described 
in this paper. The performances of different classification 
schemes are analyzed and compared across these three 
synthetically generated types of data. 

A common assumption for modeling of the original 
mRNA expressions is that they follow a multivariate 
Gaussian distribution [9,23,24]. Starting with this assump- 
tion, we model and generate the RNA-Seq and SAGE data 
by applying a specific nonlinear Poisson transformation 
to the mRNA expression model. All data are syntheti- 
cally generated according to a biologically relevant model 
to emulate the real experimental situations where the 
number of features/genes is very large, usually tens of 
thousands, and only a small number of sample points is 
available. Knowing the full distributional characteristics of 
the synthetic data makes it possible to measure the classi- 
fication performance as described by the respective error 
rates. 

MVN-GC model 

The model proposed in [9] uses a block-based struc- 
ture on the covariance matrix, which is a standard tool 
to model groups of interacting variables where there is 
negligible interaction between the groups. In genomics, 
genes within a block represent a distinct pathway and 
are correlated, whereas genes not in the same group are 
uncorrelated [25,26]. Sample points are drawn randomly 
and independently from two equally likely classes, 0 and 1, 
each sharing the same D features. There are also c equally 
likely subclasses in class 1 with different parameters for 



the probability distribution of the features. The subclasses 
model scenarios typically seen as different stages or sub- 
types of a cancer. Each sample point in class 1 belongs to 
one and only one of these subclasses. Features are catego- 
rized into two major groups: markers and non-markers. 
Markers resemble genes associated with a disease or 
condition related to the disease and they have different 
class-conditional distributions for the two classes. 

Markers can be further categorized into two different 
groups: global and heterogeneous markers. Global mark- 
ers take on values from D gm -dimensional Gaussian distri- 
butions with parameters (/Xq" 1 , E^" 1 ) for the sample points 
from class 0 and (fif m , E^ m ) for the points from class 1. 
Heterogeneous markers, on the other hand, are divided 
into c subgroups of size Dh m , each associated with one 
of c mutually exclusive subclasses within class 1. There- 
fore, a sample point that belongs to a specific subclass 
has -Dh m heterogeneous markers distributed as a Dh m - 
dimensional Gaussian with parameters (fi\ m , Sj™). The 
same Dhm heterogeneous markers for the sample points 
belonging to other subclasses, as well as points in class 0, 
follow a different Dh m -dimensional Gaussian distribution 
with parameters (p^ m , Eg m ). We assume that the global 
and heterogeneous markers have similar structure for the 
covariance matrices. Therefore, we represent the covari- 
ance matrices of these two types of markers by En = <Jq E 
and Ei = er^E for class 0 and class 1, respectively, where 
ffg and (Tj 2 can be different. For this structure, we assume 
that E has the following block structure: 



E = 



E p 0 0 
0 E p 0 



0 0 



0 
0 



0 E 



p J 



where Ep is an / x / matrix, with 1 on the diagonal 
and p off the diagonal. Note that the dimension of E 
is different for the global and heterogeneous markers. 
Furthermore, we assume that the mean vectors for the 
global markers and the heterogeneous markers possess 
the same structure denoted by jio = mo x (1, 1, ... , 1) 
and j^i = m\ x (1, 1, . . . , 1) for class 0 and class 1, respec- 
tively, where mo and m\ are scalars. The non-markers 
are also divided into two groups: high-variance and low- 
variance non-markers. The Dh v non-markers belonging to 
the high-variance group are uncorrelated and their distri- 
butions are described by pN(mo, cTq) + (1 — p)N(m\, o^), 
where mo, m\, ffg and a\ take values equal to the means 
and variances of the markers, respectively, and p is a ran- 
dom value uniformly distributed over [0, 1]. The D\ v low- 
variance non-markers are uncorrelated and have identical 
one-dimensional Gaussian distributions with parameters 
(mo, CTq ) [9,27]. Figure 1 represents the block-based struc- 
ture of the model. 
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Figure 1 Multivariate normal distribution model for generating synthetic gene expressions. This model is proposed by [9] 



NGS-reads and SAGE-tags models 

In NGS-type experiments, the gene-expression levels are 
measured by discrete values providing the number of 
reads that map to the respective gene. Several statisti- 
cal models have been proposed for representing NGS 
data. Those models are based on either negative bino- 
mial or Poisson distributions [4-6]. In what follows, we 
denote the read count for gene / for sample point i by 
Xu, where it is assumed that in an NGS experiment 
each lane has a single biological specimen belonging 
to either class 0 or class 1. Furthermore, we select a 
model where the number of reads for each gene is gen- 
erated from a Poisson distribution with a known param- 
eter. We calculate the expected number of reads (mean 
of the Poisson distribution) from the generalized linear 
model [28]: 



l0g(E[Xij\Si] ) = logs; + Xi,j + B{, 



(1) 



where ku is the /th gene-expression level in lane i. The 
term Bu represents technical effects that might be asso- 
ciated with an experiment. The term logs,- is an offset 
where s,, referred to as "sequencing depth" in the sta- 
tistical community [29], denotes a major factor in the 
transformation from expression levels to read data. It 
accounts for different total numbers of reads produced 
by each lane and plays an important role in normalizing 
the specimens across the flowcell. The trimmed mean of 
M values (TMM) [30], quantile normalization [28], and 
median count ratio [4] are three commonly used methods 
for estimating the sequencing depth. Equation (1) rep- 
resents the expected value of reads, conditioned on the 
sequencing depth, based on the linear combination of the 



factors that affect its value: the depth of sequencing, the 
gene-expression level and a general noise term. Therefore, 
it can be used to model the expected number of reads, as 
the mean of the Poisson distribution in our synthetic data 
generation pipelines. Rewriting equation (1) yields 



E[Xij\si] = si exp(Xij + Q it j), 



(2) 



indicating that if Xy and &u are normally distributed, then 
exp(kij + Bu) will have a log-normal distribution. There- 
fore, for a given s; the mean of Xu is log-normally dis- 
tributed. This phenomenon has been previously reported 
for microarray studies where the means of expression lev- 
els are shown to have log-normal distributions [23,31]. 
Furthermore, we assume that the offset s; is random and 
uniformly distributed [29]. Because the term By repre- 
sents the unknown technical effects associated with the 
experimentation, we assume that it follows a Gaussian 
distribution with zero mean and a variance set by the 
coefficient of variation (COV): 



9u~N(0,\mi -molCOV), 



(3) 



COV aims to model the unknown errors that can occur 
during an/a NGS/SAGE experiment, including basecall- 
ing, mapping reads to the reference genome, etc. The 
term E[Xu\Si] serves as the single parameter of a Poisson 
distribution that models the NGS/SAGE processes by gen- 
erating random non-negative integers, as the read counts 
or tag numbers data, having expected value equal to the 
right-hand side of equation (2). 

To generate synthetic datasets for the purposes of our 
simulation study we proceed as follows: for a sample point 
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/ in the experiment, we first randomly generate a num- 
ber Si from a uniform distribution, U(a, /3), where a > 0 
and p > a. Then, a random sample point, ku, is drawn 
from the MVN-GC model and its value is perturbed with 
Ou, which is drawn randomly according to its distribution 
defined in (3). Using (2), the mean of the Poisson distribu- 
tion is calculated for each sample point i and gene /, and 
a single random realization of Xu is generated. The pro- 
cesses of generating count data for RNA-Seq reads and 
SAGE tag numbers are very similar, but the total number 
of reads per NGS run is significantly more than tags for 
a SAGE experiment. Therefore, we only change a and /3 
to get the desired number of read counts or tags. We also 
assume that the SAGE experiments always have a fixed 
range for the total number of tags, whereas RNA-Seq has 
a variety of ranges for the total read counts. By having 
different ranges for the read counts, we can compare the 
performance of classification resulting from NGS-reads 
and SAGE-tags models under different experimental 
settings. 

Classification schemes 

The setup for the classification problem is determined by 
a joint feature-label probability distribution Fxy> where 
X e MP is a random feature vector and Y e {0, 1} is 
the unknown class label of X. In the context of genomics, 
the feature vector is usually the expression levels of many 
genes and the class labels are different types or stages 
of disease to which sample points belong to. A classi- 
fier rule model is defined as a pair S), where * 
is a classification rule, possibly including feature selec- 
tion, and 3 is a training-data error estimation rule. In 
a typical classification task, a random training set S n = 
{(X 1; Yi), (X 2 , Y 2 ), .... (X„, Y„)} is drawn from F XY and 
the goal is to design a classifier \jr n = *(<S„), which takes 
X as the input and outputs a label Y. The true classifica- 
tion error of i(r n is given by e n = P(i]/ n (X) ^ Y). The error 
estimation rule S provides an error estimate, e n = a{S n ), 
for i/r„. 

In this study, we consider linear discriminant analysis 
(LDA), three nearest neighbors (3NN) and radial basis 
function support vector machine (RBF-SVM) as the clas- 
sification rules, and report the true error of the classifiers. 
We implement t-test feature selection (as a part of the 
classification rule) before the classifier design procedure 
to select d < D features with highest i-scores. The train- 
ing set with d features is then used to design the classifier. 
The same d features are also used for finding the true error 
of the designed classifier. 

LDA is the plug-in rule for the Bayes classifier when 
the class-conditional densities are multivariate Gaussian 
with a common covariance matrix [32]. The sample means 
and pooled sample covariance matrix are estimated from 
the training data S n and plugged into the discriminant 



function. If the classes are equally likely, LDA assigns x to 
class 1 if and only if 

(x - Ai) r 2" 1 (x - Ai) < (x - Ao) r S" 1 (x - Ho), (4) 

where fly is the sample mean for class y e {0, 1}, and E is 
the pooled sample covariance matrix. 

3NN is a special case of /cNN rule (with k = 3), which is 
a nonparametric classification rule based on the training 
data. The /cNN classifier assigns a label, 0 or 1, to a point 
x according to the majority of the labels of the k near- 
est training data points to it. To avoid tied votes in binary 
classification problems, an odd number is usually chosen 
for k. 

A support vector machine finds a maximal margin 
hyperplane for a given set of training sample points. If 
it is not possible to linearly separate the data, one can 
introduce some slack variables in the optimization proce- 
dure that allow the mislabeled sample points and solve the 
dual problem. Alternatively, one can transform the data 
and project it into a higher-dimensional space, where it 
becomes linearly separable. The equivalent classifier back 
in the original feature space will generally be non-linear 
[33,34]. If a Gaussian radial basis function used as the ker- 
nel function, the corresponding classifier is referred to as 
RBF-SVM. 

Simulation setup and parameters 

Figure 2 presents an overview of the simulations per- 
formed in the study. In this section we provide the setup 



Multivariate Gaussian mRNA (gene) 
Concentrations 




I 

Modeling of the 
NGS Pipeline 



I 

Modeling of the 
SAGE Pipeline 



Feature Selection, Classifition 
and Error Calculation 



Figure 2 Three different types of data are generated: (1) 
MVN-GC; (2) NGS-reads; and (3) SAGE-tags. Data sets are 
generated according to the respective statistical models. Then, the 
data is fed to the feature selection, classification and error estimation 
module. 
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and list of parameters used in the study. The analysis of 
the results follows in the next section. 

To emulate real-world scenarios involving thousands of 
genes with only a few sample points, we choose D = 
20, 000 as the number of features and n e {60, 120, 180} as 
the number of sample points available for each synthetic 
feature-label distribution. Because there is no closed form 
expression to calculate the true error of the designed clas- 
sifiers, we generate a large independent test sample of size 
n t = 3000, with samples divided equally between the 
two classes. Because the RMS between the true and esti- 
mated error when using independent test data is bounded 
above by \/2<Jn~ t , this test sample size provides an error- 
estimate RMS of less than 0.01. Once the training and test 
data are generated, we normalize the training data so that 
each feature is zero-mean unit-variance across all sample 
points in both classes. We also apply the same normal- 
ization coefficients from the training data to the test set. 
The normalized data are then used in the feature selection, 
classifier design and error calculation. The parameter set- 
tings for the MVN-GC model, SAGE-tags and NGS-reads 
are provided in Table 1. 

As explained in the previous section, the datasets gen- 
erated from the MVN-GC model are transformed into the 
NGS-reads and SAGE-tags datasets through equation (2) 



Table 1 MVN-GC, SAGE-tags and NGS-reads models 
parameters 



Parameters 


Value 


Feature size D 


20000 


Training sample 


60,120,180 


size n 




Test sample size n t 


3000 


Class 0 (m 0 ,cro) 


(0.0, 0.6) 


Class 1 (/t?],<ti) 


(1.0, 0.6) 


Correlation p 


0.4, 0.8 


Block size / 


5 


Global markers D gm 


10 


Subclasses c 


2 


Heterogenous 


50 


markers per 




subclass Dh m 




High-variance 


2000 


non-markers Dh v 




Low-variance 


17890 


non-markers D| v 




COV 


0.05, 0.1 


SAGE-tags range 


50K-100K 


NGS-reads range 


1 K-50K, 250K-300K, 500K-550K 




5M-5M+50K, 1 0M-1 0M+50K, 1 5M-1 5M+50K 




25M-25M+50K, 32.5M-32.5M+50K, 40M-40M+50K 




50M-50M+50K, 75M-75M+50K, 100M-100M+50K 



and Poisson processes. We only need to properly set the 
parameters COV, a, and /J to get the desired number 
of read counts or tags. We assume that the parameter 
COV can take on two values, 0.01 and 0.05, representing 
two different levels of noise and unknown errors in the 
experiment. In its current state, RNA-Seq technology can 
provide different numbers of reads per sample, depend- 
ing on many factors, such as quality of the sample, the 
desired coverage, sample multiplexing, etc. In this study, 
we examine a variety of ranges for RNA-Seq experiments. 
We start with a very low total number of RNA-Seq reads, 
which may not match the real-world experiments, how- 
ever, they are necessary for comparing the SAGE-tags and 
NGS-reads models with similar coverage. Furthermore, 
demonstrating the classification results for the NGS-reads 
model with wide ranges introduces an extra internal vari- 
ability, which makes interpretations of the results rather 
difficult. Table 1 lists the NGS-reads ranges we have con- 
sidered in this study. In a typical SAGE experiment, one 
expects 50K to 100K tags [35,36]. Using trial and error, we 
have found that by choosing the parameters a = 2.0 and 
P = 3.75 in the distribution of s,-, the observed number of 
tags usually falls within this range. Similarly, the parame- 
ters a and /3 are chosen to meet the range requirements in 
the NGS-reads model. 

Our goal is to study the performance of the traditional 
classification schemes on different sources of random 
samples; thus, we take a Monte-Carlo approach and gen- 
erate 10, 000 random training sets of size n from the 
MVN-GC model, transform them to the corresponding 
NGS-reads and SAGE-tags samples, and apply the classi- 
fication schemes to each training sample. By taking such 
an approach, we aim to compare the classification perfor- 
mance of the three pipelines, in terms of the true error 
of the deigned classifiers. We also study the effect of 
NGS-reads and SAGE-tags transformations on the per- 
formance of a simple i-test biomarker discovery method, 
where we report the probability that global markers are 
recovered when d <5C D features are selected after the 
feature-selection step. 

Results 

The probability of recovering a certain number of global 
markers after a i-test feature selection can be approx- 
imated empirically by the percentage of experiments 
(out of 10,000 independent experiments) that detect 
such a number of markers. This probability depends 
on the size of the training data sample, quality of fea- 
tures, and underlying joint probability distribution of 
the features. Here, we only show the results for the 
MVN-GC model, with d = 10, in Table 2. Tables 3, 
4, 5 and 6 represent the corresponding results for 
NGS-reads for different combinations of COV and p: 
{0.05, 0.4}, {0.05, 0.8}, {0.1, 0.4}, {0.1, 0.8}, respectively. In 
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Table 2 Percentage of the MVN-GC experiments 
recovering certain number of global markers after feature 
selection 



Correlation 


Sample 
size 


<8 
markers 


8 

markers 


9 

markers 


10 
markers 




n = 60 


5.4733 


1 5.8242 


404508 


38.2517 


p = 0.4 


n = 120 


0.0108 


0.2375 


6.1483 


93.6033 




n = 180 


0.0000 


0.0017 


0.3367 


99.6617 




n = 60 


9.5742 


9.7425 


22.0408 


58.6425 


p = 0.8 


n = 120 


0.1742 


0.4692 


2.6925 


96.6642 




n = 180 


0.0025 


0.0175 


0.1933 


99.7867 



each table, we also report the corresponding results for 
the SAGE-tags model in a row with the NGS -reads range 
of [50K — 10QK"]. A successful feature selection should 
identify as many global markers as possible, however the 
situation is worsened because with small sample sizes the 
noisy features sometimes may appear to be good among 
the 20,000 features. As the number of sample points 
increases, we expect to get better results for the feature 
selection and this is exactly what we see in Tables 2, 3, 
4, 5 and 6. Another important observation in Tables 3, 
4, 5 and 6 is the effect of the total number of reads and 
COV for the NGS-reads models. As the number of reads 
increases, it is more likely to pick up more global mark- 
ers, until the number of reads reaches a threshold, where 
no further improvement is observed. Similarly, for smaller 
COV the probability of selecting more global markers also 
increases. 

The true error of a designed classifier measures the gen- 
eralization capability of the classifier on a future sample 
point. Given a set of training sample points and a classi- 
fication rule, one needs the full feature-label probability 
distribution to calculate the true error of the classifier 
designed on the training set. In this study, we find the 
true error of a classifier with a large independent sam- 
ple. Because the training sample is random, the true 
error is a random variable with its own probability dis- 
tribution. Therefore, to demonstrate the actual perfor- 
mance of the designed classifiers, the estimated proba- 
bility density function (PDF) of the true error for each 
classification rule, distribution model and all data gener- 
ation models (MVN-GC, NGS-reads and SAGE-tags) is 
presented. 

We report the true errors of the designed classifiers for 
two different feature-selection schemes and classification 
rules. In the first scheme, no feature selection is per- 
formed and the first d global markers are directly used for 
classification. In the second scheme, i-test feature selec- 
tion is done before a classifier is designed to select the 
best d features. Figures 3, 4 and 5 show the salient find- 
ing of this study. We present the PDF of the true error 



Table 3 Percentage of the NGS-reads experiments 
recovering certain number of global markers after feature 
selection for COV=0.05 and p = 0.4 



NGS-reads 


Sample 


<8 


8 


9 


10 


range 


size 


markers 


markers 


markers 


markers 




n = 


60 


99.64 


0.34 


0.02 


0.00 


1 K-50K 


n = 


120 


69.98 


20.48 


8.64 


0.90 




n = 


180 


27.84 


33.76 


31.20 


7.20 




n = 


60 


77.30 


16.54 


5.60 


0.56 


50K-100K (SAGE-tags) 


n = 


120 


10.03 


26.12 


44.08 


19.77 




n = 


180 


0.88 


7.53 


39.10 


5249 




n = 


60 


41.30 


32.44 


22.00 


4.26 


250K-300K 


n = 


120 


2.10 


11.82 


41.56 


44.52 




n = 


180 


0.10 


1.74 


22.88 


75.28 




n = 


60 


34.54 


32.74 


26.22 


6.50 


500K-550K 


n = 


120 


1.82 


1 1.08 


41.56 


45.54 




n = 


180 


0.10 


1.56 


20.26 


78.08 




n = 


60 


30.98 


31.36 


30.04 


7.62 


5M-5M+50K 


n = 


120 


1.24 


8.72 


37.56 


52.48 




n = 


180 


0.04 


1.20 


18.92 


79.84 




n = 


60 


29.74 


32.68 


29.16 


8.42 


10M-10M+50K 


n = 


120 


1.22 


8.28 


37.36 


53.14 




n = 


180 


0.04 


1.00 


18.84 


80.12 




n = 


60 


28.96 


33.16 


29.92 


7.96 


15M-15M+50K 


n = 


120 


1.06 


8.06 


38.38 


52.50 




n = 


180 


0.06 


1.36 


18.08 


80.50 




n = 


60 


28.90 


32.36 


30.38 


8.36 


25M-25M+50K 


n = 


120 


0.98 


8.00 


40.12 


50.90 




n = 


180 


0.06 


1.14 


17.68 


81.12 




n = 


60 


28.60 


32.50 


30.42 


8.48 


32.5M-32.5M+50K 


n = 


120 


1.10 


7.48 


39.02 


52.40 




n = 


180 


0.02 


1.02 


17.80 


81.16 




n = 


60 


30.20 


31.46 


29.40 


8.94 


40M-40M+50K 


n = 


120 


1.34 


8.50 


38.12 


52.04 




n = 


180 


0.08 


1.02 


18.62 


80.28 




n = 


60 


28 84 


32 1 8 


30 32 


8 66 


50M-50M+50K 


n = 


120 


1.22 


8.46 


39.02 


51.30 




n = 


180 


0.04 


1.28 


17.24 


81.44 




n = 


60 


29.12 


32.50 


29.72 


8.66 


75M-75M+50K 


n = 


120 


1.08 


7.98 


39.64 


51.30 




n = 


180 


0.02 


1.16 


17.14 


81.68 




n = 


60 


29.14 


32.16 


30.84 


7.86 


100M-100M+50K 


n = 


120 


1.22 


8.78 


39.10 


50.90 




n = 


180 


0.08 


1.22 


18.32 


80.38 
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Table 4 Percentage of the NGS-reads experiments 
recovering certain number of global markers after feature 
selection for COV=0.05 and p = 0.8 



NGS-reads 


Sample 


<8 


8 


9 


10 


range 


size 


markers 


markers 


markers 


markers 




n = 


: 60 


99.02 


0.80 


0.18 


0.00 


1 K-50K 


n = 


120 


66.16 


20.68 


1 1.20 


1.96 




n = 


180 


28.54 


28.38 


30.84 


12.24 




n = 


: 60 


71.61 


17.10 


9.28 


2.01 


50K-100K (SAGE-tags) 


n = 


120 


12.85 


20.14 


36.80 


30.21 




n = 


180 


2.30 


7.43 


28.67 


61.60 




n = 


: 60 


40.94 


22.42 


24.12 


12.52 


250K-300K 


n = 


120 


5.52 


9.46 


28.58 


56.44 




n = 


180 


0.70 


1.86 


14.14 


83.30 




n = 


: 60 


34.48 


21.02 


27.56 


16.94 


500K-550K 


n = 


120 


4.84 


8.16 


24.20 


62.80 




n = 


180 


0.78 


2.16 


1 1.38 


85.68 




n = 


: 60 


31.26 


20.80 


26.64 


21.30 


5M-5M+50K 


n = 


120 


3.98 


6.62 


21.78 


67.62 




n = 


180 


0.66 


2.24 


10.72 


86.38 




n = 


: 60 


30.04 


20.86 


26.84 


22.26 


10M-10M+50K 


n = 


120 


3.54 


7.24 


22.06 


67.16 




n = 


180 


0.88 


1.44 


9.72 


87.96 




n = 


: 60 


30 08 


20 68 


27 92 


21 32 


15M-15M+50K 


n = 


120 


4.3Z 


7 1 n 

/ . I u 


ZZ.4Z 


DD. I 0 




n = 


180 


U.oU 


1 7") 

I ./ z 


y.oz 


QQ HA 
OO.UD 




n = 


: 60 


30.86 


19.92 


27.00 


22.22 


zjlvrz J IV|-rjUr\ 


n = 


120 


3.56 


7.46 


21.90 


67.08 




n = 


180 


0.58 


1.90 


8.90 


88.62 




n 


: 60 


30.50 


20.20 


27.50 


21.80 


32.5M-32.5M+50K 


■ — 


120 


3.62 


7.00 


21.66 


67.72 




n = 


180 


0.80 


1.88 


10.22 


87.10 




n = 


■ 60 


31.14 


18.40 


28.14 


22.32 


40M-40M+50K 


n = 


120 


3.90 


7.12 


22.44 


66.54 




n = 


180 


0.60 


1.90 


10.12 


87.38 




n = 


: 60 


30.92 


19.34 


27.20 


22.54 


50M-50M+50K 


n = 


120 


3.58 


6.68 


22.24 


67.50 




n = 


180 


0.68 


1.66 


10.22 


87.44 




n = 


: 60 


30.74 


19.00 


27.92 


22.34 


75M-75M+50K 


n = 


120 


3.68 


7.64 


23.46 


65.22 




n = 


180 


0.50 


1.90 


10.44 


87.16 




n = 


: 60 


31.06 


20.32 


26.70 


21.92 


100M-100M+50K 


n = 


120 


3.84 


6.48 


21.30 


68.38 




n = 


180 


0.86 


1.80 


9.94 


87.40 



Table 5 Percentage of the NGS-reads experiments 
recovering certain number of global markers after feature 



selection for COV= 


=0.1 


and 


p = 0.4 








NGS-reads 


Sample 


<8 


8 


9 


10 


range 


size 


markers 


markers 


markers 


markers 




n = 


60 


99.62 


0.36 


0.02 


0.00 


I K-5UK 


n = 


120 


70.26 


20.64 


8.02 


1.08 




n = 


180 


30.16 


32.46 


29.80 


7.58 




n = 


60 


78.67 


15.75 


5.08 


0.50 


50K- 1 OUK pAbb-tags) 


n = 


120 


10.75 


24.12 


43.45 


18.68 




n = 


180 


1.01 


8.32 


40.50 


50.17 




n = 


60 


44.64 


31.70 


20.00 


3.66 


25lm-3UUI\ 


n = 


120 


2.60 


13.76 


44.24 


39.40 




n = 


180 


0.10 


2.08 


23.84 


73.98 




n = 


60 


37.16 


32.54 


24.48 


5.82 




n = 


120 


1.92 


1 1.42 


42.50 


44.16 




n = 


180 


0.06 


1.80 


21.98 


76.16 




n = 


60 


32.36 


31.70 


28.84 


7.10 


5IVI-5(VI+5Ur\ 


n = 


120 


1.46 


9.38 


41.20 


47.96 




n = 


180 


0.08 


1.68 


20.06 


78.18 




n = 


60 


33.04 


31.94 


27.78 


7.24 


1 UM- 1 UIVI-t-5UI\ 


n = 


120 


1.52 


9.34 


41.32 


47.82 




n = 


180 


0.10 


1.48 


19.44 


78.98 




n = 


60 


31.92 


32.64 


28.00 


7.44 


1 5IVI- 1 5IV1+5UI\ 


n = 


120 


1.40 


8.36 


41.90 


48.34 




n = 


180 


0.08 


1.42 


21.20 


77.30 




n 


60 


31.82 


32.04 


28.36 


7.78 


25(VI-z5IVI+5UI\ 


■ = 


120 


1.48 


9.32 


40.86 


48.34 




n = 


180 


0.02 


1.38 


20.40 


78.20 




n = 


60 


32.06 


32.44 


27.88 


7.62 


32.5M-32.5IVI+5Ur\ 


n = 


120 


1.42 


9.44 


40.50 


48.64 




n = 


180 


0.06 


1.32 


19.90 


78.72 




n = 


60 


33.18 


31.84 


27.20 


7.78 


4U(VI-4UIVH-5UI\ 


n = 


120 


1.48 


9.38 


41.04 


48.10 




n = 


180 


0.12 


1.60 


20.12 


78.16 




n = 


60 


31.90 


31.94 


29.42 


6.74 


50M-50M+50K 


n = 


120 


1.54 


8.76 


41.08 


48.62 




n = 


180 


0.02 


1.48 


20.72 


77.78 




n = 


60 


32.58 


31.10 


28.38 


7.94 


75M-75M+50K 


n = 


120 


1.22 


8.78 


39.94 


50.06 




n = 


180 


0.02 


1.34 


19.84 


78.80 




n = 


60 


31.40 


32.66 


28.90 


7.04 


100M-100M+50K 


n = 


120 


1.50 


9.08 


40.38 


49.04 




n 


180 


0.08 


1.36 


20.02 


78.54 



Ghaffari etal. BMC Bioinformatics 2013, 14:307 
http://www.biomedcentral.eom/1 471 -21 05/1 4/307 



Page 9 of 14 



Table 6 Percentage of the NGS-reads experiments 
recovering certain number of global markers after feature 



selection for COV: 


=0.1 


and 


p = 0.8 








NGS-reads 


Sample 


<8 


8 


9 


10 


range 


size 


mdiKcib 


ITIdlKcrb 


marKcrb 


mdlKcii 




n = 


60 


99.40 


0.52 


0.08 


0.00 


1 K-50K 


n = 


120 


68.14 


19.18 


10.92 


1.76 




n = 


180 


29.54 


28.1 6 


30.44 


1 1 .86 




n = 


60 


73.14 


16.55 


8.57 


1.74 


50K-100K(SAGE-tags) 


n = 


120 


14.16 


21.13 


36.43 


28.28 




n = 


180 


2.52 


8.03 


30.1 6 


59.29 




n = 


60 


44.26 


21.56 


23.10 


1 1.08 


250K-300K 


n = 


120 


5.54 


10.34 


29.88 


54.24 




n = 


180 


0.84 


3.06 


1 5.92 


80.1 8 




n = 


60 


36.74 


22.00 


26.16 


15.10 


500K-550K 


n = 


120 


4.74 


9.10 


28.14 


58.02 




n = 


180 


0.56 


2.66 


13.68 


83.10 




n = 


60 


32.76 


20.90 


26.96 


19.38 


5M-5M+50K 


n — 


120 


4.38 


7.30 


24.74 


63.58 




n = 


180 


0.62 


2.08 


1 1.92 


85.38 




n = 


60 


32.56 


20.06 


26.84 


20.54 


10M-10M+50K 


n — 


120 


4.38 


8.72 


23.72 


63.18 




n = 


180 


0 70 


2 1 8 


1 1 62 


85 50 




n = 


60 


31 90 


21 54 


27 70 


18 86 


15M-15M+50K 


n — 


120 


4.46 


8.16 


23.28 


64.10 




n = 


180 


0 76 


2 1 2 


1 2 02 


85 1 0 




n = 


60 


32 56 


21 02 


26 90 


1 9 52 


25M-25M+50K 


n = 


120 


3.96 


8.44 


24.30 


63.30 




n — 


180 




9 39 


1 1 R£ 
I I .OD 


8^ 98 

OJ.ZO 




n = 


60 


33 on 


91 3D 


26 74 




32.5M-32.5M+50K 


n = 


120 


4.18 


7.20 


24.10 


64.52 




n = 


180 


0 64 


1 98 


1 0 94 


86 44 




n = 


60 


32.94 


21 .00 


27.26 


18.80 


40M-40M+50K 


n = 


120 


4.36 


7.48 


24.22 


63.94 




n — 


180 


0 80 


2 24 


1116 


85 80 




n = 


60 


32 70 


20 64 


26 70 


1 9 96 


50M-50M+50K 


n = 


120 


4.14 


8.52 


23.42 


63.92 




n = 


180 


0.74 


2.32 


11.84 


85.10 




n = 


60 


33.38 


20.34 


27.02 


19.26 


75M-75M+50K 


n = 


120 


4.48 


7.90 


24.26 


63.36 




n = 


180 


0.60 


1.92 


11.00 


86.48 




n = 


60 


32.70 


21.34 


26.86 


19.10 


100M-100M+50K 


n = 


120 


4.54 


7.88 


23.64 


63.94 




n 


180 


0.80 


2.58 


11.64 


84.98 



for different classification rules trained on 60 and 180 
sample points when the correlation among features in the 
same block is high (p = 0.8) and COV= 0.05. Distribu- 
tions with higher and tighter densities around lower true 
errors indicate better classifier performance. If the PDFs 
can be approximated as univariate Gaussians, then a good 
classification performance amounts to smaller mean and 
variance. In all cases, for similar sample sizes and sim- 
ilar settings for p and COV, MVN-GC outperforms the 
two other data types. This is not surprising since it is 
considered as the ground truth. Also, it is evident that 
a larger sample size gives better classifiers with smaller 
variance as illustrated by the distribution of the true 
error. 

Figures 3, 4 and 5 show that utilizing the best d features 
(with or without feature selection) in the model, SAGE- 
tags and NGS-reads for small ranges of the total reads 
yield similar results (or even much worse when the range 
is smaller) in terms of the classification performance, and 
both are inferior when compared to the ground truth. 
This may lead to a conclusion that one should not expect 
improvements in classification performance when gene- 
expression levels are processed through an NGS-reads 
pipeline. However, our main objective here is to show 
that as one increases the total number of reads for the 
NGS-reads model, improvement can be achieved and the 
error rates decrease. This conclusion confirms the intu- 
ition provided by a simple calculation about the increase 
of the separability of two Poisson random variables with 
means proportional to the number of reads. However, 
notice that the modeling of the sequencing pipeline intro- 
duces randomness in the means of the respective Poisson 
parameters describing the individual gene reads. More- 
over, our focus is not that much on the separability of 
the two classes/phenotypes but rather on the classification 
performance as measured by the classification error and 
its dependence on ground truth (MVN-GC model) sam- 
ple size, sequencing depth, and feature vectors. Our goal 
for modeling NGS-reads with small ranges is to demon- 
strate the performance of SAGE and RNA-Seq, when their 
data is similar. This result suggests that having a larger 
number of reads for the RNA-Seq experiments could 
compensate for the errors that can occur during the NGS- 
reads pipeline sample processing. Here we have shown 
the results only for COV = 0.05, since in our obser- 
vations, it appears that changing COV has little effect 
on the distribution of the true error for both SAGE-tags 
and NGS-reads models, except that it slightly increases 
the variance of the PDFs. The effect of feature selec- 
tion (right columns) on the performance is best shown 
when the sample size is small. In this case, the variance 
of the true error is so large that drawing any conclusion 
regarding the performance on a small sample would be 
futile. Another interesting observation is that, on average, 
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0.15 0.2 

True error 



0.15 0.2 

True error 






— MVN-GC, sample size 60, EfTrue error]=0.13^04 
■ ■■ NGS-reads, sample size 60, E[True error]=0.16801 
SAGE-tags, sample size 60, EfTrue error]=0.20172 

— MVN-GC, sample size 180, EfTrue error]=0.10918 
NGS-reads, sample size 180, E[True error]=0.14675 
SAGE-tags, sample size 180, EfTrue error]=0. 17763 



0.15 0.2 

True error 




0.15 0.2 

True error 



Figure 3 Probability density estimate of the true error for LDA classification rule, without feature selection (left column, d = 1 0 global 
markers are directly used), with f-test feature selection (right column), p = 0.8, COV= 0.05, £\ X,j e (50K, 1 00K) for SAGE-tags and the 
following total number of reads for NGS-reads: (a,b) £ y Xy e (1 K, 50K); (c,d) £ ; X u e (250K, 300K); (e,f) X u e (25M, 25M + 50K); 
(9,h) Y,jXij e (50M,50M + 50K). 
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Figure 4 Probability density estimate of the true error for 3NN classification rule, without feature selection (left column, d = 1 0 global 
markers are directly used), with t-test feature selection (right column), p = 0.8, COV= 0.05,£.Jfy e (50K, 100K) for SAGE-tags and the 
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Figure 5 Probability density estimate of the true error for RBF-SVM classification rule, without feature selection (left column, d = 1 0 
global markers are directly used), with f-test feature selection (right column), p = 0.8, COV= 0.05, £iXy 6 (50K, 100K) for SAGE-tags 
and the following total number of reads for NGS-reads: (a,b)£yXy e (1K,50K); (c,d) T,j x ij e (250K,300K); (e,f) Y.j x Q e (25M, 
25M + 50K); (g,h) J^Xy e (50M, 50M + 50K). 
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3NN and RBF-SVM classification rules outperform LDA 
for both NGS-reads and SAGE-tags data, supporting the 
idea that using linear classifiers for these types of data 
is not the best way to proceed in a highly non-Gaussian 
model - as is the case after the data have gone through 
the pipeline. The best rates for the expected true error 
across all settings are achieved by the RBF-SVM classifi- 
cation rule, even for small samples, especially with small 
variance. 

Conclusions 

In this paper, we model gene-expression levels as a mul- 
tivariate Gaussian distribution that statistically captures 
the real mRNA levels within the cells. The newly devel- 
oped technologies of sequencing DNA/RNA, referred to 
as NGS, and their ascendant SAGE technology generate 
discrete measures for gene expressions. The count data 
from these technologies can be modeled with a Pois- 
son distribution. The multivariate Gaussian gene expres- 
sions are transformed through a Poisson filter to model 
NGS and SAGE technologies. The three categories of 
data are subjected to feature selection and classifica- 
tion. The objective is to evaluate the performance of 
the NGS technologies in classification. Our simulations 
show that when the gene expressions are directly used 
in classification, the best performance in terms of the 
classification error is achieved. The NGS-reads model 
generates considerably higher coverage for genes and can 
outperform SAGE in classification, when the experiment 
generates large number of reads. Even though NGS still 
has a variety of error sources involved in its process, 
its high volume of reads for a specific gene can lower 
the chance of mis classification. Thus, it is important to 
use the highest possible coverage for the entire genome 
while performing a RNA-Seq analysis if the goal of the 
study is to distinguish the samples of interest. Never- 
theless, one must recognize that, as is typical with non- 
linear transformations, the NGS pipeline transforms the 
original Gaussian data in such a way as to increase clas- 
sification difficulty. As more refined modeling becomes 
available for the NGS pipeline, further study needs to 
performed, as has been done in the case of the LC-MS 
proteomics pipeline [1], to determine which segments of 
the pipeline are most detrimental to classification and 
what, if anything, can be done to mitigate classification 
degradation. 
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